13 research outputs found
Adaptive load balancing for HPC applications
One of the critical factors that affect the performance of many applications is load imbalance. Applications are increasingly becoming sophisticated and are using irregular structures and adaptive refinement techniques, resulting in load imbalance. Moreover, systems are becoming more complex. The number of cores per node is increasing substantially and nodes are becoming heterogeneous. High variability in the performance of the hardware components introduces further imbalance. Load imbalance leads to drop in system utilization and degrades the performance. To address the load imbalance problem, many HPC applications employ dynamic load balancing algorithms to redistribute the work and balance the load. Therefore, performing load balancing is necessary to achieve high performance.
Different application characteristics warrant different load balancing strategies. We need a variety of high-quality, scalable load balancing algorithms to cater to different applications. However, using an appropriate load balancer is insufficient to achieve good performance because performing load balancing incurs a cost. Moreover, due to the dynamic nature of the application, it is hard to decide when to perform load balancing. Therefore, deciding when to load balance and which strategy to use for load balancing may not be possible a priori.
With the ever increasing core counts on a node, there will be a vast amount of on-node parallelism. Due to the massive on-node parallelism, load imbalance occurring at the node level can be mitigated within the node instead of performing a global load balancing. However, having the application developer manage resources and handle dynamic imbalances is inefficient as well as is a burden on the programmer.
The focus of this dissertation is on developing scalable and adaptive techniques for handling load imbalance. The dissertation presents different load balancing algorithms for handling inter and intra-node load imbalance. It also presents an introspective run-time system, which will monitor the application and system characteristics and make load balancing decisions automatically
HPAC-Offload: Accelerating HPC Applications with Portable Approximate Computing on the GPU
The end of Dennard scaling and the slowdown of Moore's law led to a shift in
technology trends toward parallel architectures, particularly in HPC systems.
To continue providing performance benefits, HPC should embrace Approximate
Computing (AC), which trades application quality loss for improved performance.
However, existing AC techniques have not been extensively applied and evaluated
in state-of-the-art hardware architectures such as GPUs, the primary execution
vehicle for HPC applications today.
This paper presents HPAC-Offload, a pragma-based programming model that
extends OpenMP offload applications to support AC techniques, allowing portable
approximations across different GPU architectures. We conduct a comprehensive
performance analysis of HPAC-Offload across GPU-accelerated HPC applications,
revealing that AC techniques can significantly accelerate HPC applications
(1.64x LULESH on AMD, 1.57x NVIDIA) with minimal quality loss (0.1%). Our
analysis offers deep insights into the performance of GPU-based AC that guide
the future development of AC algorithms and systems for these architectures.Comment: 12 pages, 12 pages. Accepted at SC2
Fast And Automatic Floating Point Error Analysis With CHEF-FP
As we reach the limit of Moore's Law, researchers are exploring different
paradigms to achieve unprecedented performance. Approximate Computing (AC),
which relies on the ability of applications to tolerate some error in the
results to trade-off accuracy for performance, has shown significant promise.
Despite the success of AC in domains such as Machine Learning, its acceptance
in High-Performance Computing (HPC) is limited due to stringent requirements
for accuracy. We need tools and techniques to identify regions of code that are
amenable to approximations and their impact on the application output quality
to guide developers to employ selective approximation. To this end, we propose
CHEF-FP, a flexible, scalable, and easy-to-use source-code transformation tool
based on Automatic Differentiation (AD) for analyzing approximation errors in
HPC applications. CHEF-FP uses Clad, an efficient AD tool built as a plugin to
the Clang compiler and based on the LLVM compiler infrastructure, as a backend
and utilizes its AD abilities to evaluate approximation errors in C++ code.
CHEF-FP works at the source by injecting error estimation code into the
generated adjoints. This enables the error-estimation code to undergo compiler
optimizations resulting in improved analysis time and reduced memory usage. We
also provide theoretical and architectural augmentations to source code
transformation-based AD tools to perform FP error analysis. This paper
primarily focuses on analyzing errors introduced by mixed-precision AC
techniques. We also show the applicability of our tool in estimating other
kinds of errors by evaluating our tool on codes that use approximate functions.
Moreover, we demonstrate the speedups CHEF-FP achieved during analysis time
compared to the existing state-of-the-art tool due to its ability to generate
and insert approximation error estimate code directly into the derivative
source.Comment: 11 pages, to appear in the 2023 IEEE International Parallel and
Distributed Processing Symposium (IPDPS'23
Power, Reliability, Performance: One System to Rule Them All
En un dise帽o basado en el marco de programaci贸n paralelo Charm ++, un sistema de tiempo de ejecuci贸n adaptativo interact煤a din谩micamente con el administrador de recursos de un centro de datos para controlar la energ铆a mediante la programaci贸n inteligente de trabajos, la reasignaci贸n de recursos y la reconfiguraci贸n de hardware. Gestiona simult谩neamente la fiabilidad al enfriar el sistema al nivel 贸ptimo de la aplicaci贸n en ejecuci贸n y mantiene el rendimiento a trav茅s del equilibrio de carg
Applying Graph Partitioning Methods in Measurement-based Dynamic Load Balancing
Load imbalance in an application can lead to degradation of performance and a significant drop in system utilization. Achieving the best parallel efficiency for a program requires optimal load balancing which is an NP-hard problem. This paper explores the use of graph partitioning algorithms, traditionally used for partitioning physical domains/meshes, for measurement-based dynamic load balancing of parallel applica- tions. In particular, we present repartitioning methods that consider the previous mapping to minimize dynamic migration costs. We also discuss the use of a greedy algorithm in conjunction with iterative graph partitioning algorithms to reduce the load imbalance for graphs with heavily skewed load distributions. These algorithms are implemented in a graph partitioning toolbox called SCOTCH and we use CHARM++, a migratable objects based programming model, to experiment with various load balancing scenarios. To compare with different load balancing strategies based on graph partitioners, we have implemented METIS and ZOLTAN-based load balancers in CHARM++. We demonstrate the effectiveness of the new algorithms de- veloped in SCOTCH in the context of the NAS BT solver and two micro-benchmarks. We show that SCOTCH based strategies lead to better performance compared to other existing partitioners, both in terms of the application execution time and fewer number of objects migrated.Ope
Parallel Programming with Migratable Objects: Charm++ in Practice
The advent of petascale computing has introduced new challenges (e.g. Heterogeneity, system failure) for programming scalable parallel applications. Increased complexity and dynamism in science and engineering applications of today have further exacerbated the situation. Addressing these challenges requires more emphasis on concepts that were previously of secondary importance, including migratability, adaptivity, and runtime system introspection. In this paper, we leverage our experience with these concepts to demonstrate their applicability and efficacy for real world applications. Using the CHARM++ parallel programming framework, we present details on how these concepts can lead to development of applications that scale irrespective of the rough landscape of supercomputing technology. Empirical evaluation presented in this paper spans many miniapplications and real applications executed on modern supercomputers including Blue Gene/Q, Cray XE6, and Stampede
Meta-Balancer: automated load balancing based on application behavior
With the dawn of petascale, and with exascale in the near future, it has become significantly difficult to write parallel applications that fully exploit the processing
power, and scale to large systems. Load imbalance, both computationally and communication induced, presents itself as one of the important challenges in achieving scalability and high performance. Problem sizes and system sizes have become so large that manually handling the imbalance in dynamic
applications, and finding an optimum distribution of load has become a herculean task.
Charm++~\cite provides the user with a run time system that
performs dynamic load balancing. To enable Charm++ to perform load balancing
in an efficient manner, the user takes certain decisions such as when to load
balance and which strategy to use, and informs the Charm++ run-time system of
these decisions. Many a times, taking these important decisions involve hand
tuning each application by observing various runs of the application.
In this thesis, a Meta-Balancer which relieves the user from the
effort of making the load balancing related decisions, is presented. The Meta-Balancer
is a part of the Charm++ load balancing framework. It identifies the characteristics
of the application, and based on the principle of persistence and the accrued
information, makes load balancing related decisions. We study the performance of
the Meta-Balancer in the context of leanmd mini application.
We also evaluate the Meta-Balancer in the context of micro benchmarks such as kNeighbor and jacobi2D.
We also present several new load balancing strategies, that have been
incorporated into Charm++, and study their impact on the performance of applications.
These new strategies are: 1)RefineSwapLB, which is a refinement based load balancing strategy,
2)CommAwareRefineLB, which is a communication aware refinement strategy,
3)ScotchRefineLB, which is a refinement based graph partitioning strategy using
Scotch, a graph partitioner, and 4) ZoltanLB, which is a multicast
aware load balancing strategy using Zoltan, a hypergraph partitioner
Automated Load Balancing Invocation based on Application Characteristics
Abstract鈥擯erformance of applications executed on large parallel systems suffer due to load imbalance. Load balancing is required to scale such applications to large systems. However, performing load balancing incurs a cost which may not be known a priori. In addition, application characteristics may change due to its dynamic nature and the parallel system used for execution. As a result, deciding when to balance the load to obtain the best performance is challenging. Existing approaches put this burden on the users, who rely on educated guess and extrapolation techniques to decide on a reasonable load balancing period, which may not be feasible and efficient. In this paper, we propose the Meta-Balancer framework which relieves the application programmers of deciding when to balance load. By continuously monitoring the application characteristics and using a set of guiding principles, Meta-Balancer invokes load balancing on its own without any prior application knowledge. We demonstrate that Meta-Balancer improves or matches the best performance that can be obtained by fine tuning periodic load balancing. We also show that in some cases Meta-Balancer improves performance by 18% whereas periodic load balancing gives only a 1.5 % benefit. Keywords-load balancing, automated, parallel, simulation I